speaker diarization
Probabilistic Fusion and Calibration of Neural Speaker Diarization Models
Alvarez-Trejos, Juan Ignacio, Balanya, Sergio A., Ramos, Daniel, Lozano-Diez, Alicia
End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing multiple diarization systems, DOVER-Lap remains the only established approach, operating at the segment level with hard decisions. We propose working with continuous probability outputs, which enables more sophisticated fusion and calibration techniques that can leverage model uncertainty and complementary strengths across different architectures. This paper presents the first comprehensive framework for calibrating and fusing EEND models at the probability level. We investigate two output formulations (multilabel and powerset representations) and their impact on calibration and fusion effectiveness. Through extensive experiments on the CallHome two-speaker benchmark, we demonstrate that proper calibration provides substantial improvements even for individual models (up to 19% relative DER reduction), in some cases mitigating the absence of domain adaptation. We reveal that joint calibration in powerset space consistently outperforms independent per-speaker calibration, that fusion substantially improves over individual models, and that the Fuse-then-Calibrate ordering generally outperforms both calibrating before fusion and uncalibrated fusion while requiring calibration of only a single combined model. Our best configuration outperforms DOVER-Lap in terms of DER while providing reliable confidence estimates essential for downstream applications. This work proposes best practices for probability-level fusion of EEND systems and demonstrates the advantages of leveraging soft outputs over hard decisions.
- North America > United States > Texas > Kleberg County (0.04)
- North America > United States > Texas > Chambers County (0.04)
- Europe > Spain > Galicia > Madrid (0.04)
- (2 more...)
LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization
We introduce LibriConvo, a simulated multi-speaker conversational dataset based on speaker-aware conversation simulation (SASC), designed to support training and evaluation of speaker diarization and automatic speech recognition (ASR) systems. Unlike prior resources that mostly rely on semantically disconnected utterances and implausible temporal gaps, LibriConvo ensures semantic coherence and realistic conversational timing. Our pipeline leverages CallHome with external VAD for reliable boundaries, applies compression to reduce unnaturally long silences, and organizes LibriTTS utterances by book to maintain contextual consistency. Acoustic realism is enhanced via a novel room impulse response selection procedure that ranks speaker-microphone configurations by spatial plausibility, balancing realism and diversity. The dataset comprises 240.1 hours across 1,496 dialogues with 830 unique speakers, split in a speaker-disjoint manner for robust evaluation. Baselines show that the sortformer model outperforms the pyannote pipeline in diarization, while a fine-tuned Fast Conformer-CTC XLarge with Serialized Output Training achieves 7.29\% WER for ASR, surpassing zero-shot Whisper-large-v3. LibriConvo provides a valuable resource for advancing multi-speaker speech processing research with realistic conversational dynamics and controlled experimental conditions.
- North America > United States (0.04)
- Europe > Hungary > Budapest > Budapest (0.04)
- Research Report > Strength High (0.54)
- Research Report > Experimental Study (0.54)
M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models
Kwon, Yejin, Kang, Taewoo, Yoon, Hyunsoo, Kim, Changouk
We present M3-SLU, a new multimodal large language model (MLLM) benchmark for evaluating multi-speaker, multi-turn spoken language understanding. While recent models show strong performance in speech and text comprehension, they still struggle with speaker-attributed reasoning, the ability to understand who said what and when in natural conversations. M3-SLU is built from four open corpora (CHiME-6, MELD, MultiDialog, and AMI) and comprises over 12,000 validated instances with paired audio, transcripts, and metadata. It includes two tasks: (1) Speaker-Attributed Question Answering and (2) Speaker Attribution via Utterance Matching. We provide baseline results for both cascaded pipelines and end-to-end MLLMs, evaluated using an LLM-as-Judge and accuracy metrics. Results show that while models can capture what was said, they often fail to identify who said it, revealing a key gap in speaker-aware dialogue understanding. M3-SLU offers as a challenging benchmark to advance research in speaker-aware multimodal understanding.
- North America > United States > Michigan (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
Benchmarking Diarization Models
Lanzendörfer, Luca A., Grötschla, Florian, Blaser, Cesare, Wattenhofer, Roger
Speaker diarization is the task of partitioning audio into segments according to speaker identity, answering the question of "who spoke when" in multi-speaker conversation recordings. While diarization is an essential task for many downstream applications, it remains an unsolved problem. Errors in diarization propagate to downstream systems and cause wide-ranging failures. To this end, we examine exact failure modes by evaluating five state-of-the-art diarization models, across four diarization datasets spanning multiple languages and acoustic conditions. The evaluation datasets consist of 196.6 hours of multilingual audio, including English, Mandarin, German, Japanese, and Spanish. Overall, we find that PyannoteAI achieves the best performance at 11.2% DER, while DiariZen provides a competitive open-source alternative at 13.3% DER. When analyzing failure cases, we find that the primary cause of diarization errors stem from missed speech segments followed by speaker confusion, especially in high-speaker count settings.
Domain-Aware Speaker Diarization On African-Accented English
Okocha, Chibuzor, Ezema, Kelechi, Grant, Christan
This study examines domain effects in speaker diarization for African-accented English. We evaluate multiple production and open systems on general and clinical dialogues under a strict DER protocol that scores overlap. A consistent domain penalty appears for clinical speech and remains significant across models. Error analysis attributes much of this penalty to false alarms and missed detections, aligning with short turns and frequent overlap. We test lightweight domain adaptation by fine-tuning a segmentation module on accent-matched data; it reduces error but does not eliminate the gap. Our contributions include a controlled benchmark across domains, a concise approach to error decomposition and conversation-level profiling, and an adaptation recipe that is easy to reproduce. Results point to overlap-aware segmentation and balanced clinical resources as practical next steps.
- North America > United States > Florida > Alachua County > Gainesville (0.04)
- North America > United States > Colorado > Boulder County > Boulder (0.04)
- Africa > Uganda (0.04)
- (5 more...)
- Research Report > Experimental Study (0.67)
- Research Report > New Finding (0.47)
Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation
Kim, Miseul, Park, Soo Jin, Byun, Kyungguen, Shin, Hyeon-Kyeong, Moon, Sunkuk, Zhang, Shuhua, Visser, Erik
This can cause segments from the same speaker to be misclassified as different individuals, for example, when one raises their voice or speaks faster during conversation. To address this, we propose a style-controllable speech generation model that augments speech across diverse styles while preserving the target speaker's identity. The proposed system starts with diarized segments from a conventional diarizer. For each diarized segment, it generates augmented speech samples enriched with phonetic and stylistic diversity. And then, speaker embeddings from both the original and generated audio are blended to enhance the system's robustness in grouping segments with high intrinsic intra-speaker variability.
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization
Khan, Anas Anwarul Haq, Verma, Utkarsh, Ramakrishnan, Ganesh
We introduce DEEVISum (Distilled Early Exit Vision language model for Summarization), a lightweight, efficient, and scalable vision language model designed for segment wise video summarization. Leveraging multi modal prompts that combine textual and audio derived signals, DEEVISum incorporates Multi Stage Knowledge Distillation (MSKD) and Early Exit (EE) to strike a balance between performance and efficiency. MSKD offers a 1.33% absolute F1 improvement over baseline distillation (0.5%), while EE reduces inference time by approximately 21% with a 1.3 point drop in F1. Evaluated on the TVSum dataset, our best model PaLI Gemma2 3B + MSKD achieves an F1 score of 61.1, competing the performance of significantly larger models, all while maintaining a lower computational footprint. We publicly release our code and processed dataset to support further research.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (3 more...)
Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder
Shakeel, Muhammad, Sudo, Yui, Peng, Yifan, Lin, Chyi-Jiunn, Watanabe, Shinji
--This paper presents a unified multi-speaker encoder (UME), a novel architecture that jointly learns representations for speaker diarization (SD), speech separation (SS), and multi-speaker automatic speech recognition (ASR) tasks using a shared speech foundational encoder . We leverage the hidden representations from multiple layers of UME as a residual weighted-sum encoding (RWSE) to effectively use information from different semantic levels, contributing to bottom-up alignment between tasks. Our evaluations demonstrate that UME substantially improves over the single-task baselines dedicated to SD, SS, and multi-speaker ASR on LibriMix evaluation sets. Notably, for SD, UME outperforms the previous studies, achieving diarization error rates of 1.37% and 2.29% on Libri2Mix and Libri3Mix evaluation sets, respectively. Speaker diarization (SD), speech separation (SS), and multi-speaker automatic speech recognition (ASR) are tasks of great importance that aim to comprehend and answer the question "who spoke what and when," with applications to transcribing meetings and interviews, among others.
- Asia > Japan (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States (0.04)
SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription
Grossman, Raymond, Park, Taejin, Dhawan, Kunal, Titus, Andrew, Zhi, Sophia, Shchadilova, Yulia, Wang, Weiqing, Balam, Jagadeesh, Ginsburg, Boris
We introduce SPGISpeech 2.0, a dataset suitable for speaker-tagged transcription in the financial domain. SPGISpeech 2.0 improves the diversity of applicable modeling tasks while maintaining the core characteristic of the original SPGISpeech dataset: audio snippets and their corresponding fully formatted text transcriptions, usable for end-to-end automatic speech recognition (ASR). SPGISpeech 2.0 consists of 3,780 additional hours of professionally transcribed earnings calls. Furthermore, the dataset contains call and speaker information for each audio snippet facilitating multi-talker ASR. We validate the utility of SPGISpeech 2.0 through improvements in speaker-tagged ASR performance of popular speech recognition models after fine-tuning on SPGISpeech 2.0. Released free for non-commercial use, we expect SPGISpeech 2.0 to foster advancements in speech recognition technologies and inspire a wide range of research applications.
- North America > United States (0.04)
- South America (0.04)
- North America > Central America (0.04)
- (3 more...)
Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization
Wang, Hsuan-Yu, Lee, Pei-Ying, Chen, Berlin
--In this paper, we investigate the impact of incorporating timestamp-based alignment between Automatic Speech Recognition (ASR) transcripts and Speaker Diarization (SD) outputs on Speech Emotion Recognition (SER) accuracy. Misalignment between these two modalities often reduces the reliability of multimodal emotion recognition systems, particularly in conversational contexts. T o address this issue, we introduce an alignment pipeline utilizing pre-trained ASR and speaker diariza-tion models, systematically synchronizing timestamps to generate accurately labeled speaker segments. Our multimodal approach combines textual embeddings extracted via RoBERT a with audio embeddings from Wav2V ec, leveraging cross-attention fusion enhanced by a gating mechanism. Experimental evaluations on the IEMOCAP benchmark dataset demonstrate that precise timestamp alignment improves SER accuracy, outperforming baseline methods that lack synchronization. Speech Emotion Recognition (SER) has gained substantial research attention, particularly for its applications in human-computer interaction.
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.88)